How do we know if water quality meets environmental standards?
What sample size do we need for reliable biodiversity surveys?
How can we predict extreme climate events?
Today’s Journey
Normal Distribution
Environmental patterns
Probability calculations
Sampling Distributions
From samples to populations
Central Limit Theorem
Making reliable predictions
Why This Matters
Common Environmental Applications
Water quality monitoring
Species population assessment
Climate variation analysis
Pollution level compliance
Common Confusions to Avoid
Watch Out For These!
Population vs Sample
Population parameters (μ, σ) are usually unknown
We estimate them from sample statistics (\bar{x}, s)
Example: All possible stream temperatures vs our measurements
Distribution Shape vs Mean
Same mean doesn’t mean same distribution
Need to consider spread and shape
Example: Two sites with same average temperature but different variability
Sample Size Effects
Larger samples = Better estimates
But how large is “large enough”?
Depends on how variable your data is
Learning Outcomes
Understand what a (probability) distribution is:
the properties of a continuous distribution.
Use Normal Distribution to understand/describe data
Be able to standardise a Normal;
Calculate probabilities based on Normal Distribution using R.
Know that there are other continuous distributions useful in hypothesis testing.
Distinguish between population, sample and sampling distributions;
Distinguish between a standard deviation and standard error of the mean;
Describe the Central Limit Theorem;
Use R and Excel to calculate the standard error and probabilities associated with sampling distributions;
Types of data
Numerical
Continuous: yield, weight
Discrete: weeds per m^2
Categorical
Binary: 2 mutually exclusive categories
Ordinal: categories ranked in order
Nominal: qualitative data
Example
The gestation period (in days) for American Simmental cattle is distributed with mean 284.3 and standard deviation 5.52. How often is a calf born a week early?
In our case we are generally referring to a distribution function
This is a function (or model) that describes the probability that a system will take on value or set of values {x}
For any variable X, we describe probabilities by
Discrete variables: probability distribution function P(X=x)
Continuous variables: probability density function f(x)
Discrete and Continuous variables: cumulative density function F(x) = P(X≤x)
Environmental Data Example: Water Temperature
In environmental science, we often need to understand the pattern of measurements to make decisions. Let’s look at stream temperature monitoring:
Temperature Monitoring Background
Daily water temperature measurements follow patterns
Understanding these patterns helps protect aquatic life
We need to assess risks of extreme temperatures
Code
# Define parameterstemp_mean <-22# Mean temperature in °Ctemp_sd <-1.5# Standard deviation in °Cthresh <-24# Environmental threshold# Create temperature range for plottingtemp_range <-seq(temp_mean -4* temp_sd, temp_mean +4* temp_sd, length.out =1000)temp_df <-data.frame(temperature = temp_range)# Create visualizationggplot(temp_df, aes(x = temperature)) +stat_function(fun = dnorm, args =list(mean = temp_mean, sd = temp_sd),color ="blue" ) +geom_vline(xintercept = thresh, linetype ="dashed", color ="red") +annotate("text",x = thresh +0.5, y =0.2,label ="Environmental\nThreshold",color ="red", hjust =0 ) +labs(title ="Stream Water Temperature Distribution",subtitle ="Daily measurements follow a pattern we can describe",x ="Temperature (°C)",y ="Relative Frequency" ) +theme_cowplot()
This pattern in our data lets us: - Predict future temperatures - Assess risks to aquatic life - Plan monitoring strategies - Make management decisions
Understanding how to describe and work with these patterns is key to environmental science.
Properties of a Continuous Distribution
For any continuous distribution
There is an infinite number of possible values;
These values may be within a fixed interval. For example, male human heights (in cm) belong to [54.6,272].
Specific values in a continuous distribution have probability 0. For example, the likelihood of measuring a Simmental cow at exactly 450kg is zero. This is because there are potentially an infinite number of other weights that are higher or lower than 450 kg so we say that measuring exactly 450 has a very very small probability which is equivalent to zero
The total of all the probabilities = must be 1. (Total area under the pdf)
The Normal Distribution
The Normal Distribution is super important because it occurs everywhere! It naturally describes many natural phenomenon and is a great for modelling the sample mean.
It is a symmetric bell-shaped variable with two parameters \mu and \sigma^2 such that:
X\sim{N(\mu,\sigma^2)}
The Standard Normal Curve
The standard normal curve is one where the mean = 0, and variance = 1
X\sim{N(\mu=0,\sigma^2=1)}
Code
# Create a sequence of values for the x-axisx_values <-seq(-4, 4, by =0.01)# Create a data frame to hold these valuesdata_frame <-data.frame(x_values)# Plot the standard normal curveggplot(data_frame, aes(x = x_values)) +stat_function(fun = dnorm, args =list(mean =0, sd =1),color ="blue" ) +labs(title ="Standard Normal Curve",x ="Z-Score",y ="Density" )
The General Normal Curve
Simmental cattle gestation times…
Code
# Define parametersmean <-284.3sd <-5.52# Create data and plotx_values <-seq(mean -4* sd, mean +4* sd, length.out =1000)df <-data.frame(x = x_values)ggplot(df, aes(x = x_values)) +stat_function(fun = dnorm, args =list(mean = mean, sd = sd),color ="blue" ) +labs(title ="Normal Distribution Curve",x ="Gestation Period (days)",y ="Density" )
The General Normal Distribution
If X\sim{N(\mu,\sigma^2)}
PDF
f(x | \mu, \sigma) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x - \mu}{\sigma}\right)^2} for x \in (-\infty,\infty)
CDF
F(x)=P(X\le x)=\int_{-\infty}^x f(y)dy
Types of Normal Probabilitites
There are 3 type of probabilities that we are interested in:
Tail probabilities (lower and upper) = Cumulative probabilities
Interval probabilities;
Inverse probabilities.
Normal distribution in R
Types of Normal Probabilities in Environmental Science
Lower Tail: Early Warning Thresholds
Water Temperature Example:P(T\le 20) - Risk of cold stress Gestation Example:P(X\le 275) - Early birth monitoring
- 9.1% chance of temperatures below cold stress threshold
Code
cat(sprintf("- %.1f%% chance of early birth (before 275 days)\n", p_early *100))
- 4.6% chance of early birth (before 275 days)
These lower tail probabilities help us: - Identify risks of extreme events - Plan monitoring and intervention strategies - Set appropriate warning thresholds - Make evidence-based management decisions
cat(sprintf("- %.1f%% of temperature readings should fall in optimal range\n", p_temp_normal *100))
- 49.5% of temperature readings should fall in optimal range
Code
cat(sprintf("- %.1f%% of births expected during normal period\n", p_gest_normal *100))
- 33.2% of births expected during normal period
Code
cat("\nManagement Applications:\n")
Management Applications:
Code
cat("1. Setting monitoring frequency\n")
1. Setting monitoring frequency
Code
cat("2. Resource allocation planning\n")
2. Resource allocation planning
Code
cat("3. Early intervention thresholds\n")
3. Early intervention thresholds
Code
cat("4. Performance benchmarking\n")
4. Performance benchmarking
This analysis shows how interval probabilities help: - Define normal operating conditions - Set realistic expectations - Plan resource allocation - Design monitoring programs
Finding Critical Values for Management
When designing monitoring programs, we often need to find values that capture specific probabilities:
What temperature should trigger interventions? P(T \le x)=0.9
When should we flag delayed births? P(X \le x)=0.95
Code
# Calculate critical valuestemp_90 <-qnorm(0.9, temp_mean, temp_sd)gest_95 <-qnorm(0.95, 284.3, 5.52)# Create temperature plot with critical valuetemp_plot <-ggplot(temp_df, aes(x = temperature)) +stat_function(fun = dnorm, args =list(mean = temp_mean, sd = temp_sd),color ="blue" ) +geom_area(data =subset(temp_df, temperature <= temp_90),aes(y =dnorm(temperature, temp_mean, temp_sd)),fill ="purple", alpha =0.3 ) +geom_vline(xintercept = temp_90, linetype ="dashed", color ="purple") +annotate("text",x = temp_90, y =0.2,label =sprintf("90th Percentile\n%.1f°C", temp_90),color ="purple", hjust =-0.1 ) +labs(title ="Water Temperature Critical Value",subtitle ="90% of readings fall below this threshold",x ="Temperature (°C)",y ="Density" )# Create gestation plot with critical valuegest_plot <-ggplot(gest_df, aes(x = days)) +stat_function(fun = dnorm, args =list(mean =284.3, sd =5.52),color ="blue" ) +geom_area(data =subset(gest_df, days <= gest_95),aes(y =dnorm(days, 284.3, 5.52)),fill ="purple", alpha =0.3 ) +geom_vline(xintercept = gest_95, linetype ="dashed", color ="purple") +annotate("text",x = gest_95, y =0.05,label =sprintf("95th Percentile\n%.1f days", gest_95),color ="purple", hjust =-0.1 ) +labs(title ="Gestation Period Critical Value",subtitle ="95% of births occur before this time",x ="Days",y ="Density" )# Display plots side by sidegridExtra::grid.arrange(temp_plot, gest_plot, ncol =2)
cat(sprintf("- Consider interventions when temperature exceeds %.1f°C\n", temp_90))
- Consider interventions when temperature exceeds 23.9°C
Code
cat(sprintf("- Investigate if gestation exceeds %.1f days\n", gest_95))
- Investigate if gestation exceeds 293.4 days
Code
cat("\nApplications:\n")
Applications:
Code
cat("1. Setting monitoring thresholds\n")
1. Setting monitoring thresholds
Code
cat("2. Designing intervention protocols\n")
2. Designing intervention protocols
Code
cat("3. Resource planning\n")
3. Resource planning
Code
cat("4. Risk assessment")
4. Risk assessment
These critical values help establish evidence-based management protocols and early warning systems.
Connecting Probability to Sampling
We’ve seen how to: - Calculate various types of probabilities - Work with environmental thresholds - Make evidence-based decisions
But in real-world monitoring, we rarely know the true population parameters. Instead: - We take samples - Calculate sample statistics - Use these to make inferences
Key Questions
This leads us to two crucial questions: 1. How do sample means behave? 2. How reliable are our probability calculations with sample data?
The Central Limit Theorem will help us answer these questions…
Progress Check ✓
Let’s review what we’ve learned about probability calculations:
We can calculate different types of probabilities:
Lower tail: P(X \leq x) - early warning
Upper tail: P(X \geq x) - critical thresholds
Interval: P(a \leq X \leq b) - normal ranges
Inverse: Finding x for given probability
These help us:
Assess environmental risks
Set monitoring thresholds
Make evidence-based decisions
Plan interventions
Questions to consider:
How do these calculations change with sample data?
What happens to probabilities as sample size changes?
Example
Let’s return to our example for American Simmental cattle where X~N(284.3, 5.522),
What is the probability of a gestation time less than 275 days.
So we need to calculate the lower Tail probability such that: P(X \le 275)
Code
# Calculate probability of early gestation# P(X ≤ 275) where X ~ N(284.3, 5.52²)pnorm(275, 284.3, 5.52)
[1] 0.04601526
One would expect around 5% of gestation times would be less than 275 days/
Question for you: Why might this be important, how can we use these results??
vxnaghiyev - stock.adobe.com
Back to the Standard Normal Curve
Sometimes it is useful to standardise “data” as it allows us to compare samples that are drawn from populations that may have different means and standard deviations
Luckily for us we can standardise any general normal distribution X\sim{N(\mu,\sigma^2)} to a standard normal distribution X\sim{N(0,1)}
This was also useful as we could use a set of standard normal tables to calculate probabilities (before computers were readily available).
# Calculate P(X ≤ 14) where X ~ N(10, 9)# Method 1: Direct calculationprob1 <-pnorm(14, 10, 3)# Method 2: Using standardized valueprob2 <-pnorm(4/3, 0, 1)# Method 3: Using standardized value (default parameters)prob3 <-pnorm(4/3)# Display resultsc(direct = prob1, standardized = prob2, default = prob3)
direct standardized default
0.9087888 0.9087888 0.9087888
Percentiles of the Standard Normal Curve
1 Standard deviation from the mean = 68% of the data
Percentiles of the Standard Normal Curve
2 Standard deviations from the mean = 95% of the data
Percentiles of the Standard Normal Curve
3 Standard deviations from the mean = 99.7% of the data
Not so Normal Distributions
Student T: The Student T distribution models a symmetric bell-shaped variable with thicker tails than a Normal.
We say the variable X \sim t_n, with n degrees of freedom.
It has an extra parameter = n which is related to population size
The T distribution is used for the 1 and 2 sample T-tests which are really important in the next few weeks.
Not so Normal Distributions
The Chi-Squared distribution models a variable which can only take positive values and is skewed in distribution.
We say the variable X \sim \chi_n^2, with n degrees of freedom.
The Chi-Squared distribution is used for the Chi-Squared Test which you will cover in the next few weeks.
Sampling distributions
Rye grass root growth (in mg dry weight) follows the distribution X \sim N(300, 502).
One measurement is taken: how likely is it that the dry weight exceeds 320 mg?
10 measurements are taken: how likely is it that the sample mean exceeds 320 mg?
SERHII BLIK - stock.adobe.com
Sampling distributions
Here, we are dealing with 2 distributions:
Measurement: X \sim N(300,50^2)
Sample Mean of 10 measurements: \overline X = \frac{1}{10}\Sigma_{i=1}^{10} X_i \sim ...
How does the sampling distribution occur?
http://onlinestatbook.com/stat_sim/sampling_dist/
We have a population X
We take a sample of size n and we calculate the mean \overline x_1
We take another sample of size n and we calculate the mean \overline x_2
We take another sample of size n and we calculate the mean \overline x_3 … If we sample all possibilities, then the sampling distribution of \overline X = \frac{1}{10}\Sigma_{i=1}^{10} X_i is the distribution of \{\overline x_1, \overline x_2, \overline x_3,...\}
Distribution for a sample mean
if X\sim{N(\mu,\sigma^2)}
then \overline X\sim{N(\mu,\frac{\sigma^2}{n})}
Note that we call
\sigma the standard deviation such that sd(X)=\sigma, and
\sigma/\sqrt n the standard error such that sd(\overline X)=\sigma/\sqrt n
The standard error is important for making inference on a sample populations i.e. how close your sample mean \overline x is to the population mean \mu
Example
Rye grass root growth (in mg dry weight) follows the distribution X \sim N(300,50^2).
One measurement is taken: how likely is it that the dry weight exceeds 320 mg?
10 measurements are taken: how likely is it that the sample mean exceeds 320 mg?
# Calculate probability of sample mean exceeding 320mg# P(X̄ > 320) where X̄ ~ N(300, 50²/10)# Standardized to P(Z > 1.26)1-pnorm(1.26)
[1] 0.1038347
The Central Limit Theorem in Environmental Monitoring
Why This Matters
In environmental science, we often face: - Non-normal data (e.g., pollution levels, species counts) - Need to aggregate multiple measurements - Want to make reliable inferences
The Central Limit Theorem (CLT) tells us: - Sample means follow a normal distribution - Regardless of the original distribution shape - More samples = more normally distributed
Requirements for CLT
For reliable results, we need: 1. Independent random samples 2. Large enough sample size: - n > 30 for skewed data (e.g., pollution levels) - n > 15 for symmetric data (e.g., temperature readings) 3. Finite variance exists
Environmental Applications
Let’s see this in action with environmental data…
Example: Water Quality Monitoring
Consider daily pollution measurements: - Often right-skewed (many low values, few high spikes) - Single readings can be misleading - Need to understand behavior of sample means
CLT in Environmental Monitoring
Let’s demonstrate how the CLT works with real environmental data:
Code
# Set random seed for reproducibilityset.seed(123)# Define parametersnumber_of_samples <-1000sample_sizes <-c(5, 10, 30, 50)distributions <-list("Normal"= rnorm,"Exponential"= rexp,"Chi-Squared (df = 2)"=function(n) rchisq(n, df =2))# Function to generate sample meansgenerate_sample_means <-function(sample_size, number_of_samples, dist_function) {sapply(1:number_of_samples, function(x) mean(dist_function(sample_size)))}# Generate sample means for all combinationssample_means_list <-lapply(distributions, function(dist_function) {lapply(sample_sizes, generate_sample_means,number_of_samples = number_of_samples,dist_function = dist_function )})# Convert to data framesample_means_df <-do.call(rbind, lapply(names(distributions), function(dist_name) {do.call(rbind, lapply(1:length(sample_sizes), function(i) {data.frame(Distribution = dist_name,Sample_Size = sample_sizes[i],Sample_Mean = sample_means_list[[dist_name]][[i]] ) }))}))
Central Limit Theorem in Action
Environmental Monitoring Examples
We’ll demonstrate the CLT using three types of environmental data: 1. Stream temperatures (normally distributed) 2. Air pollution levels (right-skewed) 3. Species counts (discrete data)
Visualizing the CLT
Let’s see how sample means behave as sample size increases:
Code
# Rename distributions to environmental contextsample_means_df <- sample_means_df %>%mutate(Distribution =case_when( Distribution =="Normal"~"Stream Temperature", Distribution =="Exponential"~"Air Pollution", Distribution =="Chi-Squared (df = 2)"~"Species Abundance",TRUE~ Distribution ))# Enhanced visualizationggplot(sample_means_df, aes(x = Sample_Mean)) +# Add density estimategeom_density(color ="red", linewidth =1) +# Add histogram with improved aestheticsgeom_histogram(aes(y =after_stat(density)),bins =30,fill ="steelblue",alpha =0.7,color ="white" ) +# Facet by distribution type and sample sizefacet_grid( Distribution ~ Sample_Size,scales ="free",labeller =labeller(Distribution =c("Stream Temperature"="Temperature (°C)\nNormally Distributed","Air Pollution"="PM2.5 Levels\nRight-Skewed","Species Abundance"="Species Counts\nDiscrete Data" ) ) ) +# Improved labelslabs(title ="Central Limit Theorem in Environmental Monitoring",subtitle ="Sample means approach normal distribution as sample size increases",x ="Sample Mean",y ="Density" ) +# Consistent themetheme_cowplot() +theme(panel.spacing =unit(1, "lines"),strip.text =element_text(face ="bold"),plot.title =element_text(face ="bold"),plot.subtitle =element_text(margin =margin(b =10)) )
Code
# Print environmental monitoring implicationscat("\nImplications for Environmental Monitoring:\n\n")
Implications for Environmental Monitoring:
Code
cat("1. Stream Temperature:\n")
1. Stream Temperature:
Code
cat(" - Even small samples (n=10) give reliable means\n")
- Even small samples (n=10) give reliable means
Code
cat(" - Good for continuous monitoring programs\n\n")
- Good for continuous monitoring programs
Code
cat("2. Air Pollution:\n")
2. Air Pollution:
Code
cat(" - Requires larger samples (n≥30) for reliable means\n")
- Requires larger samples (n≥30) for reliable means
Code
cat(" - Important for regulatory compliance\n\n")
- Important for regulatory compliance
Code
cat("3. Species Abundance:\n")
3. Species Abundance:
Code
cat(" - Needs n≥30 for normal approximation\n")
- Needs n≥30 for normal approximation
Code
cat(" - Critical for biodiversity assessments\n")
- Critical for biodiversity assessments
Key Points for Environmental Scientists
The CLT helps us:
Design monitoring programs
Choose appropriate sample sizes
Set sampling frequencies
Balance cost and accuracy
Make reliable inferences
Estimate population parameters
Calculate confidence intervals
Test hypotheses about means
Ensure quality control
Set warning thresholds
Monitor system changes
Make evidence-based decisions
Looking Ahead: CLT and Hypothesis Testing
The CLT is fundamental to statistical inference because it tells us that:
Sample Means are Normally Distributed
Even when original data isn’t normal
Enables use of z-tests and t-tests
Supports confidence interval calculations
Standard Error Matters
Measures uncertainty in sample means
Decreases with larger sample sizes: SE = \frac{\sigma}{\sqrt{n}}
Helps determine required sample sizes
Coming Up in Future Lectures
One-sample tests: Compare means to standards
Two-sample tests: Compare different treatments
ANOVA: Compare multiple groups
Example: When testing if stream temperatures exceed regulatory limits, we’ll use:
Sample means (normally distributed thanks to CLT)
Standard error (to assess uncertainty)
t-tests (based on normal distribution assumptions)
Progress Check: Probability and CLT ✓
Let’s connect what we’ve learned:
Probability Calculations
Lower tail P(X \leq x) - Early warning
Upper tail P(X \geq x) - Critical thresholds
Intervals P(a \leq X \leq b) - Normal ranges
Sample Means (CLT)
Approach normal distribution
More reliable with larger samples
Enable statistical inference
Applications
Design monitoring programs
Set evidence-based thresholds
Make reliable predictions
Key Achievement: You can now connect probability theory to practical environmental monitoring decisions! 🎯